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ABSTRACT 

The amount of concurrency available In conjugate gradient Iteration Is 
limited by the summations required In the Inner product computations. The 
Inner product of two vectors of length N requires time c*log(N), If N or 
more processors are available. 

This paper describes an algebraic restructuring of the conjugate gradient 
algorithm which minimizes data dependencies due to Inner product 
calculations. After an Initial start up, the new algorithm can perform a 
conjugate gradient Iteration In time c*log(log(N)) . 
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residence at ICASE, NASA Langley Research Center, Hampton, VA 23665. 


1 







Introduction 


Conjugate gradient iteration is a method of linear equation solution of 
great practical Importance. See, for example, Hestenes and Stelfel [A], 
Concus, Golub and O'Leary [3], or Chandra [2]. It can be used to solve any 
linear system 

Au = b 

where A Is symmetric, positive definite, and can be quite efficient when 
coupled with various preconditioning techniques. However, CG (conjugate 
gradient) iteration Involves the computation of inner products at every 
iteration. On parallel computers with large numbers of processors, the data 
dependencies inherent in these inner product calculations will limit the speed 
of conjugate gradient iteration for large sparse linear systems. This is 
pointed out in Schreiber^^^ and Adams [1982]. In fact, given sufficiently 
many processors, the summation fan-ins in the inner product calculations will 
ddrainate the computation time on nearly all large sparse linear systems 
occurring in practice. 

2. Cionjugate Gradient Iteration 

This paper presents a solution to this problem through an algebraic 
restructuring of the CG Algorithm. Consider first the standard CG 
iteration. One of a number of mathematically equivalent forms of it may be 
given as follows: 

^^^Schrieber, R. 1983, Stanford University, California, personal communi- 
cation; Schrieber, R. , 1981, SIAM J. Sci. Statist. Comput., to be published. 
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The data dependencies here are severe. One cannot generate 

until a , and X - are known. But these quantities Involve Inner 
n-1 n-1 ^ 

products dependent on As pointed out above, an Inner product on 

vectors of length N requires time c*log(N). Thus It would seem that a CG 
Iteration could not be done faster than In time c*log(N). 


3. Idea of New Algorithm 

This natural seeming Idea, that a CG Iteration on vectors of length N 
cannot be done faster than In time c*log(N), turns out to be Incorrect. To 
see why, consider the computation of a typical Inner product required. 


By the formulas above, r^’^^ is given as 
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r<") - - J 

n-1 

Now suppose we know and but not the value of 

In this case we would be unable to evaluate but we could still 

perform most of the work involved in evaluating this inner product. 
Specifically, we can write the recurrence 

(r(n),,(n)^ . ^ ,(n-l) ^^(n-1)^ 

.2 f (n-1) (n-1)', 

+ ^n-l^^P .Ap^ 

and can proceed to evaluate all inner products on the right here. If 
subsequently someone told us the value of we could compute the value of 

very rapidly, since only a few more real operations would then be 
needed to complete evaluation of the recurrence relation. 

It is easy to see how this idea can be used to speed the computation of 
the CG algorithm on parallel computers. We have replaced an inner product 
computation requiring data not present until iteration n with inner products 
of vectors present at iteration n-1. Since these vectors are present sooner, 
we have that much longer to perform their inner products, to achieve the same 
parallel computation speed. Stated differently, assuming only the inner 

products limit the speed of the computation, the use of this recurrence 
relation for and the analogous relation for (p^^^Ap^^^) will 

approximately double the parallel speed of CG iteration, where it is assumed 
that sufficiently many processors are available and that communications cost 
can be neglected. 
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4. Recurrence Relations 

The recurrence relation just described is one of a large class of such 
relations which can be exploited to speed up CG iteration. These relations 
will be given in detail in a future paper, but for now we consider only the 
general form of such recurrence relations. Consider the typical inner 
product: 

(n) (n) 
r ,r' 

The value of this inner product may be given in terms of the values of inner 
products of vectors occurring at any previous iteration together with the 
values of the real parameters 



a , ,a o,* , 
n-1 n-2 


X , ,X o, • • • . 
n-1 ’ n-2 ’ 


For example, for any k > 0, one can derive recurrence relations of the form 


2k 

f (n) (n)'i r f (n-k) .i (n-k)', 

[r" ^r" J = 1 a [r^ r" "J 
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The coefficients {a^^}, {b^^}, {c^} occurring here are polynomials in the 
parameters 

{a , ,a -,•••,« , ,X , ,X o,***,X 

^ n-1 n-2 n-k n-1 n-2 n-k^ 
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Similar recurrence relations are available for the other type of inner product 
occurring in CG iteration, (p^’'^\Ap^”^ ) . 


5. ifew Algorithm 

To construct a more parallel variant of CG iteration based on these 
recurrence relations, one begins by selecting a value for the constant k, 
which may be thought of as a look-ahead parameter. Then at iteration n - k, 
when vectors r^” and p^’^ become available one begins forming all of 
the inner products 




1=0,1, ••• ,2k, 
i=0,l,...,2k. 


i=0,l , • • • ,2k. 


The values of these inner products are needed in the recurrence relations for 
the inner products 


.(n) 


(n)>, 


{p‘"\ap<"’: 


at iteration n. Thus we arrive at an algorithm whose data movements are 


sketched in Figure 1. 
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inner product calculations 


Figure 1. Principal Data Movement In New CG Algorithm. 


Clearly the problems of the delays caused by the summations in the inner 
products is now solved. If we chose k = log(N), the inner product summation 
delays will have no impact on algorithm speed. However, two new issues now 
arise. First, we have not dealt with the way in which the parameters 

{a , ,a -,oi , ,X T,***X 
^ n-1 n-2 n-k n-1 n-2 n-k^ 

enter into the recurrence relations. In principle, there could be severe data 
dependencies here. Second, there seem to be a large number of inner products 
required now, most involving a relatively high power of the matrix A. 

Neither of these problems is as serious as it first appears. For the 
first, it turns out the coefficients {a^} {a^} {c^} in the recurrence 
relations above are polynomials in the parameters 
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which are at most quadratic in each parameter separately. This fact, coupled 
with the observation that the parameters 


CL CL • • • \ \ ••• 

“n-k ’ n-k+1 ‘ ’ n-k ’ n-k+1 ’ 

gradually become available, enables us to effectively perform the coefficient 
evaluations in a pipelined fashion. Thus at iteration n, when we need the 
inner product , we can have the recurrence relation (*) 

completely evaluated, except for performing the summations, or the analogous 
summations in the recurrence for (p^^^Ap^’^^) . This requires parallel time 


log(k) = log(log(N)). 

The second problem mentioned above, the occurrence of high powers of the 
matrix A in the recurrence relation (*), can be resolved by the use of 
additional recurrence relations. First, observe that there is no need to 
compute powers of the matrix A, since we have the recurrences: 


n-1 


.1 (n) .1 (n) . .1 (n-1) 

Ap '=Ar +« Ap 


Thus the set of vectors updated with 

only one matrix vector product. 

Next observe that nearly all of the inner products needed can also be 
obtained by recurrences. We have 
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and similar recurrences for the other types of Inner products occurring In 
relation (*)• Given the values of the Inner products 
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at Iteration n, we can obtain nearly all of the Inner products needed at 
Iteration n+1. Only two Inner products need to be computed directly. 


6. Computational Complexity 

As pointed out above, the summations In the recurrence relations (*) 
require time 

log(k) = log(log(N)). 

Thus If matrix A has at most d nonzeroes per row or column, this algorithm 
requires parallel time 


max(log(d),log(log(N))) . 
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The sequential complexity of this algorithm Is essentially the same as that of 
the usual CG algorithm; we still need two inner products and a matrix vector 
product at every iteration. 



10 


REFERENCES 

[1] Adams, L. [1982]. "Iterative Algorithms for Large Sparse Linear Systems 
on Parallel Computers," NASA Contractor Report 166027, NASA Langley 
Research Center- 

[2] Chandra, R. [1978]. "Conjugate Gradient Methods for Partial Differential 
Equations," Ph.D. Thesis, Research Report /M29, Department of Computer 
Science, Yale University. 

[3] Concus, P. , Golub, G. and O'Leary, D. [1976], "A Generalized Conjugate 
Gradient Method for the Numerical Solution of Elliptic Partial 
Differential Equations," Sparse Matrix Computations , eds. J. Bunch, D. 
Rose, Academic Press, pp. 309-332. 

[4] Hestenes, M. , and Stlefel, E. [1952]. "Methods of Conjugate Gradients 
for Solving Linear Systems," J. Res. Nat. Bur. Std., pp. 409-426. 



1. Report No. 

NASA CR- 172 178 

2. Government Accession No. 

3. Recipient's Catalog No. 

4. Title and Subtitle 

Minimizing Inner Product Data Dependencies in Conjugate 

5. Report Date 
July 1983 

Gradient Iteration 


6. Performing Organization Code 

7. Author(s) 

John Van Rosendale 


8. Performing Organization Report No. 

83-36 



10. Work Unit No. 

9. Performing Organization Name and Address 

Institute for Computer Applications in Science 
and Engineering 

Mail Stop 1.32C, NASA Langley Research Center 


NASl-17130 

Hampton, VA 23665 


13. Type of Report and Period Covered 

12. Sponsoring Agency Name and Address 
National Aeronautics and Space Administration 
Washington, D.C. 20546 

Contractor report 

14. Sponsoring Agency Code 


15. Supplementary Notes 


Langley Technical Monitor: Robert H. Tolson 

Final Report 

16. Abstract 

The amount of concurrency available in conjugate gradient iteration is limited by 
the summations required in the inner product computations. The inner product of two 
vectors of length N requires time c*log(N), if N or more processors are available. 

This paper describes an algebraic restructuring of the conjugate gradient 
algorithm vfhich minimizes data dependencies due to inner product calculations. After 
an initial start up, the new algorithm can perform a conjugate gradient iteration in 
time c*log(log(N)). 


17. Key Words (Suggested by Author(s)) 
Conjugate gradient 
parallel computation 
Immer products 

18. Distribution Statement 

61 Computer Programming and 
Software 

Unclassif ied-Unliraited 

19. Security Classif. (of this report) 
Unclassified 

20. Security Classif. (of this page) 
Unclassified 

21. No. of Pages 
12 

22. Price 
A02 


N-305 


For sale by the National Technical Information Service, Springfield, Virginia 22161 

















End of Document 



